The goal of this assignment is to use linear regression for both single and multiple variables. You will do the following:
Calculate single variable linear regression and assess its quality
Compare errors for training and test set
Use polynomial regression and assess its quality
Check for overfitting and select appropriate model
Heart rates
The following data set contains heart rates measured and reported by students in my class Introduction to Quantitative Modeling for Biology. There are four different heart rates measured (two at rest and two after exercise) and the year it was measured.
Rows: 1068 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): Rest1, Ex1, Rest2, Ex2, Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Select a response and an explanatory variable (from the 4 heart rate variables) and clean the data to remove any outliers or missing values in these variables. Split the data set into training and test sets of equal size.
Use linear regression on the training set and print out the summary. Make a scatterplot and overlay the linear regression model as a line on the same plot. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test.
lm_out <-lm(Ex2 ~ Rest2, data = heart_train)summary(lm_out)
Call:
lm(formula = Ex2 ~ Rest2, data = heart_train)
Residuals:
Min 1Q Median 3Q Max
-43.527 -13.675 -2.692 10.316 95.583
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.02093 5.47303 6.764 3.54e-11 ***
Rest2 0.92575 0.07145 12.957 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.71 on 532 degrees of freedom
Multiple R-squared: 0.2399, Adjusted R-squared: 0.2385
F-statistic: 167.9 on 1 and 532 DF, p-value: < 2.2e-16
# base R:plot(Ex2 ~ Rest2, data = heart_train, cex = .8, col ="blue", main =paste("Linear regression over heart data"))abline(lm_out)
#produce residual vs. fitted plotplot(fitted(lm_out), resid(lm_out))#add a horizontal line at 0 abline(0,0)
# ggplot: heart_train |>ggplot() +aes(x = Rest2, y = Ex2) +geom_point(color ='blue') +geom_smooth(method ='lm', color ='darkorange') +ggtitle(paste("Linear regression over heart data"))
`geom_smooth()` using formula = 'y ~ x'
ggplot(lm_out, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle(paste("Residuals of regression on heart data"))
The outliers look like a shapeless blob, and the p-values for both the slope and intercept are extremely small, indicating they’re certainly different from zero.
Perform quadratic regression on the training set using lm and print out the summary. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test, and report whether the fit is improved compared to the linear model.
quad_out <-lm(Ex2 ~poly(Rest2, degree =2, raw =TRUE), data = heart_train)summary(quad_out)
Call:
lm(formula = Ex2 ~ poly(Rest2, degree = 2, raw = TRUE), data = heart_train)
Residuals:
Min 1Q Median 3Q Max
-44.453 -14.116 -2.879 10.041 96.495
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.218089 22.628083 -0.275 0.783580
poly(Rest2, degree = 2, raw = TRUE)1 2.051981 0.576402 3.560 0.000404 ***
poly(Rest2, degree = 2, raw = TRUE)2 -0.007154 0.003633 -1.969 0.049473 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.65 on 531 degrees of freedom
Multiple R-squared: 0.2454, Adjusted R-squared: 0.2426
F-statistic: 86.34 on 2 and 531 DF, p-value: < 2.2e-16
# base R:plot(Ex2 ~ Rest2, data = heart_train, cex = .8, col ="blue", main =paste("Quadratic regression on heart rate data"))y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*heart_train$Rest2 + quad_out$coefficients[3]*heart_train$Rest2^2lines(heart_train$Rest2, y_pred, lwd =3, col ='darkorange')
#produce residual vs. fitted plotplot(fitted(quad_out), resid(quad_out), main =paste("Residuals of quadratic regression on heart data"))#add a horizontal line at 0 abline(0,0)
# ggplot:heart_train |>ggplot() +aes(x = Rest2, y = Ex2) +geom_point(color ='blue') +geom_smooth( method ='lm', formula ='y ~ poly(x, 2, raw=TRUE)', color ='darkorange') +ggtitle(paste("Quadratic regression"))
ggplot(quad_out, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle(paste("Residuals of quadratic regression on heart data"))
The outliers still look good, and the r-squared is improved a bit (by 1-2% in most cases, depending on the random split). The parameters are usually significant, though the quadratic coefficient is very small.
Use the linear regression model parameters to compute the predicted values for the test set, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. Use the quadratic regression model to compute the predicted values for the test set using the quadratic regression coefficients, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. How did adding the quadratic parameter impact the error for the training and test sets?
print(paste("Residual variance from linear model for the training set:", var(lm_out$residuals)))
[1] "Residual variance from linear model for the training set: 387.611676616165"
y_pred <- lm_out$coefficients[1] + lm_out$coefficients[2]*heart_test$Rest2print(paste("Residual variance from linear model for the test set:", var(y_pred - heart_test$Ex2)))
[1] "Residual variance from linear model for the test set: 369.884552666365"
print(paste("Residual variance from quadratic model for the training set:", var(quad_out$residuals)))
[1] "Residual variance from quadratic model for the training set: 384.802144846724"
y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*heart_test$Rest2 + quad_out$coefficients[3]*heart_test$Rest2^2print(paste("Residual variance from quadratic model for the for the test set:", var(y_pred - heart_test$Ex2)))
[1] "Residual variance from quadratic model for the for the test set: 360.537551724502"
Neither the training nor the test set residual variance is improved dramatically by adding the quadratic terms, but for most splits, the training set is improved more, and sometimes the error in the test set goes up, indicating potential overfitting (though nothing dramatic).
Redo the random split of the original data set into training and test sets, leaving progressively smaller fractions of the data in the training set, and calculate linear regression on the training set. Calculate and print the variance of the residuals of the training and test sets, and report whether you observe overfitting.
Call:
lm(formula = Ex2 ~ Rest2, data = heart_train)
Residuals:
Min 1Q Median 3Q Max
-18.4794 -10.0917 0.0382 11.2450 15.1063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.0002 36.6270 1.256 0.245
Rest2 0.6896 0.4555 1.514 0.168
Residual standard error: 13.05 on 8 degrees of freedom
Multiple R-squared: 0.2227, Adjusted R-squared: 0.1255
F-statistic: 2.292 on 1 and 8 DF, p-value: 0.1685
print(paste("Residual variance for the 1% training set", var(lm_out$residuals)))
[1] "Residual variance for the 1% training set 151.390262719779"
y_pred <- lm_out$coefficients[1] + lm_out$coefficients[2]*heart_test$Rest2print(paste("Residual variance for the 99% test set", var(y_pred - heart_test$Ex2)))
[1] "Residual variance for the 99% test set 386.0053338943"
It really depends on the luck of the draw! For some random selections of training data at 5% or 10%, we observe overfitting (the error in the test set is significantly higher than in the training set), but not always. With 1% split, there’s almost always overfitting, though sometimes it goes the other way. I believe this has to do with the outliers I didn’t filter out.
Ecological data
The following data set contains observations of the populations of one species of fish (cutthroat trout) and two species of salamander in Mack Creek, Andrews Forest, Willamette National Forest, Oregon. The data set contains 16 variables and over thirty-two thousand observations. The variables include time and date, location, and measurements, such as size and weight. The metadata (descriptions of data) are provided here (click on “Data entities” tab for explanations of each variable.)
Rows: 32209 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): sitecode, section, reach, unittype, species, clip, notes
dbl (8): year, pass, unitnum, vert_index, pitnumber, length_1_mm, length_2_...
date (1): sampledate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You will use the numeric variables length_1_mm and weight_g, examine the data visually to see if there are any missing values or outliers. Clean the data to remove any values you want, and filter the data to contain observations from only one species, either: ‘Cutthroat trout’ or ‘Coastal giant salamander’. Split the remaining data into training and test sets of equal size.
Use linear regression on the training set and print out the summary. Make a scatterplot and overlay the linear regression model as a line on the same plot. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test.
lm_out <-lm(weight_g ~ length_1_mm , data = mack_train)summary(lm_out)
Call:
lm(formula = weight_g ~ length_1_mm, data = mack_train)
Residuals:
Min 1Q Median 3Q Max
-15.983 -2.860 -0.265 1.870 102.381
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.389375 0.132674 -93.38 <2e-16 ***
length_1_mm 0.256016 0.001473 173.81 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.12 on 6294 degrees of freedom
Multiple R-squared: 0.8276, Adjusted R-squared: 0.8276
F-statistic: 3.021e+04 on 1 and 6294 DF, p-value: < 2.2e-16
# base R:plot(weight_g ~ length_1_mm, data = mack_train, cex = .8, col ="blue", main =paste("Linear regression on mack data"))abline(lm_out)
#produce residual vs. fitted plotplot(fitted(lm_out), resid(lm_out), main ="Residuals of linear regression on mack data")#add a horizontal line at 0 abline(0,0)
# ggplot: mack_train |>ggplot() +aes(x = length_1_mm, y = weight_g) +geom_point(color ='blue') +geom_smooth( method ='lm', color ='darkorange') +ggtitle(paste("Linear regression on mack data"))
`geom_smooth()` using formula = 'y ~ x'
ggplot(lm_out, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle(paste("Residuals of linear regression on mack data"))
No, the residuals do not look like a shapeless blob! This indicated an essential nonlinearity in the data, and thus linear model is not appropriate, despite a high r-squared of over 80%. All parameters are “significant” as indicated by their low p-values.
Perform quadratic regression on the training set using lm and print out the summary. Make a separate plot of the residuals on the training set, and assess whether it looks good (like a shapeless blob). Based on the summary, explain which parameters are significantly different from zero, according to the hypothesis test, and report whether the fit is improved compared to the linear model.
quad_out <-lm(weight_g ~poly(length_1_mm, degree =2, raw =TRUE), data = mack_train)summary(quad_out)
Call:
lm(formula = weight_g ~ poly(length_1_mm, degree = 2, raw = TRUE),
data = mack_train)
Residuals:
Min 1Q Median 3Q Max
-81.025 -0.416 -0.011 0.298 100.063
Coefficients:
Estimate Std. Error t value
(Intercept) 5.017e+00 1.794e-01 27.96
poly(length_1_mm, degree = 2, raw = TRUE)1 -2.053e-01 4.363e-03 -47.06
poly(length_1_mm, degree = 2, raw = TRUE)2 2.568e-03 2.379e-05 107.92
Pr(>|t|)
(Intercept) <2e-16 ***
poly(length_1_mm, degree = 2, raw = TRUE)1 <2e-16 ***
poly(length_1_mm, degree = 2, raw = TRUE)2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.44 on 6293 degrees of freedom
Multiple R-squared: 0.9395, Adjusted R-squared: 0.9395
F-statistic: 4.887e+04 on 2 and 6293 DF, p-value: < 2.2e-16
# base R:plot(weight_g ~ length_1_mm, data = mack_train, cex = .8, col ="blue", main =paste("Quadratic regression on mack data"))y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*mack_train$length_1_mm + quad_out$coefficients[3]*mack_train$length_1_mm^2lines(mack_train$length_1_mm, y_pred, lwd =3, col ='darkorange')
#produce residual vs. fitted plotplot(fitted(quad_out), resid(quad_out), main ="Residuals of quadratic regression on mack data")#add a horizontal line at 0 abline(0,0)
# ggplot: mack_train |>ggplot() +aes(x = length_1_mm, y = weight_g) +geom_point(color ='blue') +geom_smooth( method ='lm', formula ='y ~ poly(x, 2, raw=TRUE)', color ='darkorange') +ggtitle(paste("Quadratic regression on mack data"))
ggplot(quad_out, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0) +ggtitle(paste("Residuals of linear regression on mack data"))
The residuals now look shapeless, and the r-squared improves considerably (by 10% or more). All the parameters are still significant.
Use the linear regression model parameters to compute the predicted values for the test set, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. Use the quadratic regression model to compute the predicted values for the test set using the quadratic regression coefficients, and calculate the residuals for the test set. Print the variance of the residuals for the training and test sets, and compare them. How did adding the quadratic parameter impact the error for the training and test sets?
print(paste("Residual variance for the linear model on the training set", var(lm_out$residuals)))
[1] "Residual variance for the linear model on the training set 16.9709086146212"
y_pred <- lm_out$coefficients[1] + lm_out$coefficients[2]*mack_test$length_1_mmprint(paste("Residual variance for the linear model on the test set", var(y_pred - mack_test$weight_g)))
[1] "Residual variance for the linear model on the test set 15.8478334939268"
print(paste("Residual variance for the quadratic model on the training set", var(quad_out$residuals)))
[1] "Residual variance for the quadratic model on the training set 5.95334225461604"
y_pred <- quad_out$coefficients[1] + quad_out$coefficients[2]*mack_test$length_1_mm + quad_out$coefficients[3]*mack_test$length_1_mm^2print(paste("Residual variance for quadratic model on the test set", var(y_pred - mack_test$weight_g)))
[1] "Residual variance for quadratic model on the test set 3.8948956296826"
The error improved dramatically for both the training and the test sets with addition of the quadratic parameter!
Perform a cubic polynomial fit on the same data, print out the summary, and produce the same plots as before. Then compute and print the variance of the residuals for the training and test sets, and compare them. How did adding the cubic parameter impact the error for the training and test sets?
cub_out <-lm(weight_g ~poly(length_1_mm, degree =3, raw = T), data = mack_train)summary(cub_out)
Call:
lm(formula = weight_g ~ poly(length_1_mm, degree = 3, raw = T),
data = mack_train)
Residuals:
Min 1Q Median 3Q Max
-63.560 -0.567 0.046 0.476 99.954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.170e+00 3.942e-01 23.26 <2e-16
poly(length_1_mm, degree = 3, raw = T)1 -3.634e-01 1.407e-02 -25.82 <2e-16
poly(length_1_mm, degree = 3, raw = T)2 4.303e-03 1.489e-04 28.90 <2e-16
poly(length_1_mm, degree = 3, raw = T)3 -5.723e-06 4.850e-07 -11.80 <2e-16
(Intercept) ***
poly(length_1_mm, degree = 3, raw = T)1 ***
poly(length_1_mm, degree = 3, raw = T)2 ***
poly(length_1_mm, degree = 3, raw = T)3 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.414 on 6292 degrees of freedom
Multiple R-squared: 0.9408, Adjusted R-squared: 0.9408
F-statistic: 3.335e+04 on 3 and 6292 DF, p-value: < 2.2e-16
# base R:plot(weight_g ~ length_1_mm, data = mack_train, cex = .8, col ="blue", main =paste("Cubic regression on mack data"))y_pred <- cub_out$coefficients[1] + cub_out$coefficients[2]*mack_train$length_1_mm + cub_out$coefficients[3]*mack_train$length_1_mm^2+ cub_out$coefficients[4]*mack_train$length_1_mm^3lines(mack_train$length_1_mm,fitted(cub_out), lwd =3, col ='darkorange')
#produce residual vs. fitted plotplot(fitted(cub_out), resid(cub_out), main ="Residuals of cubic regression on mack data")#add a horizontal line at 0 abline(0,0)
# ggplot: mack_train |>ggplot() +aes(x = length_1_mm, y = weight_g) +geom_point(color ='blue') +geom_smooth( method ='lm', formula ='y ~ poly(x, degree = 3, raw = TRUE)', color ='darkorange') +ggtitle(paste("Cubic regression on mack data"))
cub_out |>ggplot() +aes(x = .fitted, y = .resid) +geom_point() +geom_hline(yintercept =0) +ggtitle(paste("Residuals of cubic regression on mack data"))
print(paste("Residual variance for the cubic model on the training set", var(cub_out$residuals)))
[1] "Residual variance for the cubic model on the training set 5.82443895879213"
y_pred <- cub_out$coefficients[1] + cub_out$coefficients[2]*mack_test$length_1_mm + cub_out$coefficients[3]*mack_test$length_1_mm^2+ cub_out$coefficients[4]*mack_test$length_1_mm^3print(paste("Residual variance for cubic model on the test set", var(y_pred - mack_test$weight_g)))
[1] "Residual variance for cubic model on the test set 4.18022557773365"
Adding a cubic term improves the error for the training set by a small amount, but it completely blows up the error for the test set! This is a dramatic example of overfitting.